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Abstract 

Boosting algorithms have been widely used to tackle a plethora of problems. In 
the last few years, a lot of approaches have been proposed to provide standard 
AdaBoost with cost-sensitive capabilities, each with a different focus. However, 
for the researcher, these algorithms shape a tangled set with diffuse differences 
and properties, lacking a unifying analysis to jointly compare, classify, evaluate 
and discuss those approaches on a common basis. In this series of two papers we 
aim to revisit the various proposals, both from theoretical (Part I) and practical 
(Part II) perspectives, in order to analyze their specific properties and behavior, 
with the final goal of identifying the algorithm providing the best and soundest 
results. 
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1. Introduction 

The classical approach to solve a classification problem is based on the use 
of a single expert that must be able to build a solution classifier. However, in 
the last few decades, a new classification paradigm, based on the combination 
of several experts in a distributed decision process, has arisen and attracted the 
attention of the Machine Learning community. The success of this paradigm 
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relies on several theoretical, practical and even biological reasons (such as gener¬ 
alization properties, complexity, data handling, data source fusion, etc.) making 
these Ensemble Classifiers [T] preferable to classical ones in many scenarios. 

One of the milestones on the history of ensemble methods was the work 
published by Robert E. Schapire in 1990 [2], in which the author proves the 
equivalence between weak learners, algorithms able to generate classifiers per¬ 
forming only slightly better than random guessing, and strong learners, those 
generating classifiers which are correct in all but an arbitrarily small fraction 
of the instances. This new model of learnability, in which weak learners can be 
boosted to achieve strong performance when they are properly combined, paved 
the way to one of the most prominent families of algorithms within the ensemble 
classifiers paradigm: boosting. 

In 1997, Yoav Freund and Robert E. Schapire [3] proposed a more general 
boosting algorithm called AdaBoost (from Adaptive Boosting). Unlike previous 
approaches, AdaBoost does not require any prior knowledge on weak hypoth¬ 
esis space, and it iteratively adjusts to weak hypothesis that become part of 
the ensemble. Apart from theoretical guarantees and practical advantages over 
its predecessors, early experiments on AdaBoost also showed a surprising re¬ 
sistance to overfitting. As a consequence of all these qualities, AdaBoost has 
received an attention “rarely matched in computational intelligence” [1] being 
an active research topic in the fields of machine learning, pattern recognition 
and computer vision msmiziiHiiniiiniiii] till present. 

Throughout this time, several studies have been conducted to analyze Ada¬ 
Boost from different points of view, relating the algorithm with different theo¬ 
ries: margin theory [3], entropy m, game theory [T3], statistics [7], etc. In the 
same way, numerous AdaBoost and boosting variants have been proposed for the 
two-class and multiclass problems: Real AdaBoost mm. LogitBoost [7], Gentle 
AdaBoost [7], AsymBoost [Hj, AdaCost [IS], AdaBoost.Ml [T^, AdaBoost.M2 
m. AdaBoost.MH [3, AdaBoost.MO [5], AdaBoost.MR [^, JointBoosting [T7] . 
AdaBoost.EGG [TH] etc. 

Among the different kinds of classification problems, one common subset is 
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that of tasks with clearly different costs depending on each possible decision, 
or scenarios with very unbalanced class priors in which one class is extremely 
more frequent or easier to sample than the other one. In such cost-sensitive or 
asymmetric conditions (disaster prediction, fraud detection, medical diagnosis, 
object detection, etc.) classifiers must be able to focus their attention in the 
rare/most valuable class. Many works in the literature have been devoted to 
cost-sensitive learning [Bunnii], including a significant set of proposals on 
how to provide AdaBoost with asymmetric properties (e.g. [m m n m m 
[Miiiniiiii [23). The link between AdaBoost and Cost-Sensitive learning has 
special interest since AdaBoost is the learning algorithm inside the widespread 
Viola-Jones object detector framework [3, a seminal work in computer vision 
dealing with a markedly asymmetric problem and a enormous number of weak 
classifiers (the order of hundred of thousands). 

The different AdaBoost asymmetric variants proposed in the literature are 
very heterogeneous, and their related works are focused on emphasizing the 
possible advantages of each respective method, rather than building a common 
framework to jointly classify, analyze and discuss the different approaches. The 
final result is that, for the researcher, these algorithms shape a confusing set with 
no clear theoretical properties to rule their application in practical problems. 

In this series of two papers we try to classify, analyze, compare and discuss 
the different proposals on Cost-Sensitive AdaBoost algorithms, in order to gain 
a unifying perspective. Our final goal is finding a definitive scheme to directly 
translate any cost-sensitive learning problem to the AdaBoost framework and 
shedding light on which algorithm can ensure the best performance. 

The current article is focused on the theoretical part of our work and it 
is organized as follows: next section focuses on standard AdaBoost and its 
related theoretical framework. Section is devoted to cluster and explain, in 
an homogeneous notational framework, the different cost-sensitive AdaBoost 
variants proposed in the literature, and in Section we analyze in depth those 
algorithms with a fully theoretical derivation scheme. Finally, we present the 
preliminary conclusions (Section]^ that will be culminated in the accompanying 
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paper “Untangling AdaBoost-based Cost-Sensitive Classification 
Part II: Empirical Analysis” covering the experimental part of our work. 

2. AdaBoost 

Let us define X as the random process from which our observations x = 
(a;i,... ,xn)^ are sampled, and Y the random variable governing the related 
labels y G { — 1,1}. In this scenario, a detector H{x) (we will also refer to it 
as classifier or hypothesis) is a function trying to guess the label y oi a, given 
sample x, and it can be defined in terms of a more generic function / (x) S K 
which we will call predictor. 


H{x) = sign [/ (x)] 

Suppose we have a training set of n examples x^ with its respective labels 
yi, a weight distribution D{i) over them and a weak learner able to select, 
according to labels and weights, the best detector /i(x) from a predefined col¬ 
lection of weak classifiers. In this scenario, the role of AdaBoost is to compute 
a goodness measure a depending on the performance obtained by the selected 
weak classifier, and to update, accordingly, the weight distribution to emphasize 
misclassified training examples. Then, with a different weight distribution, the 
weak learner can make a new hypothesis selection and the process restarts. By 
iteratively repeating this scheme ([^ [^ [^ |^ with t indexing the number of learn¬ 
ing rounds, AdaBoost obtains an ensemble of weak classifiers with respective 
goodness parameters at- 



( 1 ) 

71 

n = y^^Dt{i)yiht{yif) 

( 2 ) 



Dt{i)exp{-atyiht{x,)) 

Dt+iW = 7 

( 3 ) 
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(4) 


n 

Zt='^ Dt{i) exp {-atyiht{x^)) 

i=l 

Weak hypothesis searching in AdaBoost is guided to maximize goodness at 
of each selected classifier, which is equivalent to maximize, at each iteration, 
weighted correlation r* ([^ between labels yt and predictions ht. This iterative 
searching process can continue until a predefined number T of training rounds 
have been completed or some performance goal is reached. The final AdaBoost 
strong detector H (x) is defined (§ in terms of a boosted predictor /(x) built as 
an ensemble of the selected weak classifiers weighted by their respective goodness 
parameters at- 


H{x) = sign (/(x)) = sign 


y^,atht{-> 




(5) 


2.1. Error Bound Minimization 

Robert E. Schapire and Yoram Singer proposed [5], from the original deriva¬ 
tion of AdaBoost, a generalised and simplified analysis that models the algo¬ 
rithm as an additive (round-by-round) minimization process of an exponential 
bound on the strong classifier training error (Er). This bounding process is ex¬ 
plained in equatiorQ ([^ from which all AdaBoost equations we have presented, 
weight update rule included, can be derived HU. 


^y^ ^ y*/(x*) < 0 ^ exp{-y,f{x,)) > 1 


Et = Er=i-Di(*)[i?(xi) ^yi\ < Er=i4^i(*)exp(-2/z/(x*)) 

After (§, the final bound of the training error obtained by AdaBoost can be 
expressed as Q, and the additive minimization of the exponential bound Et 
can be seen as finding, in each round, the weak hypothesis ht that maximizes 
r(t), the weighted correlation between labels {yt) and predictions {ht). 


^Notation: [a] is 1 when a is true and 0 otherwise. 
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( 7 ) 


T 


T 


ET<Y[Zt<Y[ = Et 


t=l 


When weak hypothesis are binary, hi G {—1,+1}, the last inequality on Q 
becomes an equality, and parameter at can be directly rewritten Q in terms 
of the weighted error et of the current weak classifier. As can be seen, the 
minimization process turns out to be equivalent to simply selecting the weak 
classifier with less weighted error. 


n 


et = X! Dt{i)lht{x,) ^yil=Y^ Dt{i) 


err 



( 8 ) 


In line with other works, for the sake of simplicity and clarity, we will focus 
our analysis on this Discrete version of AdaBoost using binary weak classifiers, 
which does not prevent our conclusions from being extended to other variations 
of the algorithm. Also, trying to define an homogeneous notational framework 
for our work, we have unified the different notations found in the literature to 
that used by Schapire and Singer [5]. A summary of AdaBoost can be found 
on Algorithm (all the algorithms discussed in this paper are detailed, with 
homogeneous notation, in Appendix [A|). 

2.2. Statistical View of Boosting 

One of the milestones in boosting research and the foundation of many vari¬ 
ations of AdaBoost is the highly-cited contribution by Jerome H. Friedman et 
al. [7] in which a statistical reinterpretation of boosting is given. Following 
the exponential criterion seen in the last subsection, Friedman et al. showed 
that AdaBoost can be motivated as an iterative algorithm building an additive 
logistic regression model /(x) that minimizes the expectation of the exponential 


loss, J(/(x)): 
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J(/(x)) = E[exp(-j//(x))] 


(9) 


This defined loss is effectively minimized at 


/(x) 



( P{y = i|x) \ 

\P{v = -i|x)/ 


so a direct connection between boosting and additive logistic regression models 
is drawn. According to this statistical perspective, AdaBoost predictions can 
be seen as estimations of the posterior class probabilities, which has served as 
basis to develop many extensions and variants of the algorithm (among them, 
the Cost-Sensitive Boosting scheme [Minn]). 

It is important to mention that, despite the huge and unquestionable value of 
the statistical view, some enriching controversy, revealed by empirical evidences 
mu HZ], has arisen about inconsistencies of this interpretation. 


3. Cost-Sensitive Variants of AdaBoost 

Cost-sensitive classification problems can be fully portrayed by a cost ma¬ 
trix m whose components map the loss of each possible result. For two-class 
problems there are four kinds of results: true positives, true negatives, false 
positives and false negatives; so the cost matrix C can be defined as follows: 


Actual 

Negative Positive 

Cnn Cnp \ Negative 

j Classified 

Cpn Cpp ' Positive 

The optimal decision for a given cost matrix will not change if all its coef¬ 
ficients are added a constant, or if they are multiplied by a constant positive 
factor. As a result, a cost matrix for two-class classification problems only has 
two degrees of freedom and can be parametrized by only two coefficients: false 
negatives normalized cost (c„p) and true positives normalized cost (cpp): 
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In the most common case correct decisions have null related costs {cnn = 
Cpp = 0), so C has eventually only one degree of freedom: the ratio between 
cost of errors on positives (c„p) and cost of errors on negatives (cp„). In the 
literature and most practical problems, cost requirements are usually specified 
by these two error parameters, which, for simplicity, we will denote as Cp and 
Cn respectively. 

^ ^ / 0 Cp/Cn \ / 0 Cp\ 

\ 1 0 y Cw 0 ) 

The coefficients of a cost matrix may not be constant in general. While 
constant coefficients model a scenario where all the examples of each class have 
the same cost (class-level asymmetry), variable coefficients mean that examples 
belonging to the same class can have different costs (example-level asymmetry). 
Whatever the scenario, it is also important to notice that, for “reasonableness” 
m. correct predictions in a cost matrix should have lower associated costs than 
mistaken ones (c„„ < c„p and Cpp < Cpn)- 

Bearing in mind that class-level asymmetry is the most common for detection 
problems, and that example-level asymmetry can be modeled by a class-level 
asymmetry scheme with a resampled training dataset, for our analysis we have 
homogenized the different Asymmetric AdaBoost approaches to the class-level 
scheme. Thus, we will follow a prototypical cost-sensitive detection statement 
specified by two constant coefficients Cp and Cm, that can be alternatively 
described by the “normalized cost asymmetry” of the problem 7 € ( 0 , 1 ): 

^ Cp + Cm 

Despite the widespread use of these particularizations, in Appendix we 
will extend our conclusions to example-level asymmetry and also cases in which 
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correct classification costs are nonzero. 


It is also important to emphasize that this work is focused on AdaBoost and 
its cost-sensitive variants, a realm of methods in the literature that are based on 
a exponential loss minimization criterion, analogous to that giving rise to the 


original algorithm (as we have seen in Sections 2.1 and 2.2 from different points 
of view) . Other boosting algorithms based on other kinds of losses beyond the 
exponential paradigm, like the binomial log-likelihood [7] or the p-norm loss 
pS] . are outside the scope of the current study. 


3.1. Classification 

In order to give a clear overview of the cost-sensitive variants of AdaBoost 
proposed in the literature, we suggest an analytical classification scheme to 
cluster them into three categories according to the way asymmetry is reached: 
A posteriori, Heuristic and Theoretical. 

3.1.1. A Posteriori 

The seminal face/object detector framework by Paul Viola and Michael J. 
Jones [5] uses a validation set to modify, after training, the threshold of the 
original (cost-insensitive) AdaBoost strong classifier. The goal is to adjust the 
balance between false positive and detection rates, building, that way, a cost- 
sensitive boosted classifier: 

H(x) = sign(/(x) -4>)= sign I 

\t=i 

Besides the great success of the detection framework, the authors themselves 
acknowledge that neither this a posteriori cost-sensitive tuning ensures that the 
selected weak classifiers are optimal for the asymmetric goal nor their mod¬ 
ifications preserve the original AdaBoost training and generalization guarantees 

i- 

An useful insight on this can be drawn from the analysis by Masnadi-Shirazi 
and Vasconcelos [10]. According to the Bayes Decision Rule, the optimal predic- 
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tor /* (x) can be expressed in terms of the optimal predictor for a cost-insensitive 
scenario /g (x) and a threshold (j) depending on costs. 


r (x) 


/ J’y|x(l|x)Cp \ 
V-fV|x (—l|x) Cat/ 


log 


/ -Py|x (l|x) \ 
v^V|x (-i|x)y 



/o(x)-<?i’ 


As a consequence, for any cost requirements, the optimal cost-sensitive de¬ 
tector H* (x) can also be expressed as a threshold on the cost-insensitive optimal 
predictor /g (x). 


H* (x) = sign [r(x)]= sign [/g*(x) 


In practical terms, however, learning algorithms do not have access to the 
exact probability distributions and they must approximate this optimal detector 
rule. Thus, AdaBoost can be seen as an algorithm obtaining an approximation 
(ffg(x)) to the optimal cost-insensitive detector, built by means of an estimation 


(/g(x)) of the cost-insensitive predictor (101. 


ffg(x) = sign /g(x) = sign [ ^ atht{x) | « i?g (x) 


( 10 ) 


yt = l 


By definition, the purpose of AdaBoost is to obtain a detector as close as 
possible to the optimal one, and this optimality is ensured if the learned predictor 
satishes two necessary and suhcient conditions: 


Ho(x) = iJ^(x) 

/g (x) = /g*(x) = 0 if Py|x(l|x) = Py|x(-l|x) 

sign 


(11) 


/o (x)J = sign [/g*(x)] if Py|x(l|x) 7 ^ Py|x(-l|x) 

As can be seen, in order to reach optimal detection the predictor learned by 
AdaBoost should match the optimal predictor in the boundary region, but only 
its sign elsewhere. Analogously, optimal detection for the cost-sensitive case, 
would be ensured by two equivalent conditions: 
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/(x) = r(x) = o 


H{x) = H*{x) 

t 

if ^V|x(l|x)C'p = Py|x(—l|x)C'Ar 


( 12 ) 


sign 


/ (x) = sign [/*(x)] if Py|x(l|x)C'p Py|x(-l|x)C'Ar 


Thus, optimality conditions required by the a posteriori modification of the 
AdaBoost threshold would be as follows: 


/o (x) = /o (x) = 


sign 


/o (x) - (j) 


P(x) = P*(x) 


if -Py|x(l|x)C'p — Py|x(—l|x)C'Ar 


sign[/o(x) - (j)] if Py|x(l|x)C'p ^ Py|x(-l|x)C'Ar 


Bearing in mind that AdaBoost predictor /o(x) is geared to satisfy (11), the 
optimality conditions for threshold modification are not necessarily fulfilled. The 
only way to meet these requirements for any cost would be that the predictor 
obtained by AdaBoost matched the optimal one along the whole space, which is 


an obviously stronger condition than actually required (12). Moreover, recalling 
the exponential bounding equation in which AdaBoost is based § , we can 
see that, once the sign of the obtained predictor matches the right label, the 
error bound is further minimized for increasing absolute values of the estimated 
predictor, no matter how close they are (or not) to the optimal predictor value. 

As a consequence, there are no guarantees that a threshold change on the 
classical AdaBoost predictor will give us a cost-sensitive detector oriented to be 
optimal. Nonetheless, this non-optimality has not prevented that asymmetric 
detectors obtained by the Viola-Jones framework have been very successfully 
used for object detection. 
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3.1.2. Heuristic 

Most of the proposed cost-sensitive variations of AdaBoost [13 [HI HU US] try 
to deal with asymmetry through direct manipulations of the weight update rule 
<§ , but they are not full reformulations of AdaBoost for cost-sensitive scenarios. 
Masnadi-Shirazi and Vasconcelos pointed out that this kind of manipulations 
“provide no guarantees of asymptotic convergence to a good cost-sensitive de¬ 
cision rule” m, considering those algorithms as “heuristic” modihcations of 
AdaBoost iMiiin]- 

Although these proposals have, in greater or lesser extent, some theoretical 
basis, for the sake of clarity and distinctiveness in our analysis, we will maintain 
the term heuristic^ as used in iMiiin], to label this group of approaches based 
on the arbitrary modification of the weight update rule, as opposed to the full 
theoretical derivations we will delve into in the next subsection. 


AsymBoost 

Assuming the non-optimality of the strong classifier threshold adjustment 


procedure in their object detector framework (Section 3.1.1), Paul Viola and 


Michael J. Jones proposed a different scheme, coined as AsymBoost [H], trying 
to optimize AdaBoost for cost-sensitive classification problems. 

Discarding the asymmetric weight initialization to be “naive” and only “some¬ 
what effective” due to “AdaBoost’s balanced reweighting scheme” (we will dis¬ 


cuss on this point in Section 4.11, AsymBoost proposes to distributedly empha¬ 
size weights by an asymmetric modulation before each round. In practical terms, 
the only change is multiplying weights D{i) by a constant factor {Cp 
before every learning step of a T-round process. As a consequence, the overall 
asymmetric factor seen by positive elements across the whole process is CpjCps 
times the factor seen by negatives. 




Vj 

Dt{i) exp {-atViht (x,)) 

YTi=i Dt{i) exp {-atVrht (x,)) 
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AsymBoost, that reduces to AdaBoost when costs are uniform, is detailed 
in Algorithm (Appendix [A|) . 

Though the global AsymBoost procedure seems to be theoretically sound, 
the equitable asymmetry sharing among a fixed number of rounds entails signih- 
cant problems: Why such a rigid equitable sharing procedure should be optimal 
inserted in an adaptive framework such as AdaBoost? Why should we have to 
know in advance the number of training rounds while standard AdaBoost does 
not require that? Note that standard AdaBoost allows flexible performance 
tests to decide when to stop training, since any change in the total number of 
rounds is directly performed by training new additional rounds or trimming the 
current ensemble. However, a change in the size of the final ensemble (number 
of rounds) would strictly require Asymboost to re-train the whole classifier with 
a new asymmetry distribution. 


AdaCost 

Wei Fan et al. proposed m a cost-sensitive variation of AdaBoost called 
AdaCost. The idea behind AdaCost is to modify the weight update rule, so 
examples with higher costs have sharper increases of their weights after misclas- 
sification but lighter decreases when are succesfully classified. This scheme is 
essentially addresed by introducing a misclassification adjustment function j3{i) 


into the weight update rule (13). 


Dt+i{i) — 


Dtji) exp {-atyjht (xj) j3{i)) 
YJr=i Dt{i) exp {-atVikt (x^) /3{i)) 


(13) 


The misclassification adjustment function must depend on the cost {C{i)) of 
each example/class and the success/fail of its classihcation. As a result, /3(i) is 
imposed to be non-decreasing respect to C{i) when classification fails, and non¬ 
increasing when classihcation succeeds. This opens the door to a huge amount 
of functions satisfying such requirements, from which authors chose the next: 


/3(i) = 


0.5(1-C'(i)) 

0.5(l-kC(i)) 
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if hf{xi) = yi, 
if h/(xi) ^ j/j. 




As can be seen, AdaCost does not match with AdaBoost for uniform costs 


and also applies a cost-dependent weight pre-emphasis (see Algorithm]^. 


CSBO, CSBl and CSB2 

Following the same idea of modifying the weight update rule, the CSB 
(acronym from Cost-Sensitive Boosting) family of algorithms [33 [33] propose 
three different updating schemes depending on which parameters are involved, 
resulting in CSBO, CSBl and CSB2 algorithms (see respective Algorithms 
an d§. These rules are complemented, for all the three alternatives, by an asym¬ 
metric weight initialization and a minimum expected cost criterion for strong 
classification replacing the usual weighted voting scheme: 



H{x) = sign '^atht(x) (Cp|/it(x) = -tl] -t CAr|/it(x) = -ll) 


This new voting rule gives emphasis, in run time, to weak hypothesis deciding 
in favor of the costly class. Of the three alternatives, only the last one, CSB2, 
is reduced to standard AdaBoost when costs are equal. 

AdaCl, AdaC2 and AdaC3 

Defining new ways to modify the weight update rule, Yanmin Sun et al. m 
[35] proposed another family of asymmetric AdaBoost alternatives called AdaCl, 
AdaC2 and AdaCS. These variants couple the cost factor in different parts of the 
update equation: inside the exponent (AdaCl), outside the exponent (AdaC2) 
and both (AdaCS): 


* YTi=\ Dt{i) exp {-atCiy^ht (x^)) 


Dt{i) exp {-atCiyiht (x^)) 



c^Dt{i) exp {-atyiht (x^)) 
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„ CiDtji) exp {-atCjyjht (xj)) 

* Yh=i CiDtii) exp {-atCiy^ht (x^)) 

As a difference from previous approaches, these changes in the weight update 
are also propagated to the way goodness parameter at is defined and, as a 
consequence, have influence on how the weak classifier error is computed (see 
Algorithms All these variants reduce to AdaBoost when the cost 

function C{i) is 1 for all examples. 

3.1.3. Theoretical 

The methods in the previous subsection have one key point in common: 
the starting point of their derivations is an arbitrary modification of the weight 
update rule. However, as can be easily shown following the work by Schapire and 
Singer [5] , weight update in standard AdaBoost is actually a consequence of the 
error minimization procedure § and not an arbitrary starting point of it. Thus, 
the way to reach theoretically sound cost-sensitive boosting algorithms should 
be to walk the path in the opposite direction: designing a new asymmetric 
derivation scheme to obtain a new full formulation (that may include a new 
weight update rule), instead of partially adapting previous equations. 

There are three alternatives in the literature that follow different theoret¬ 
ically sound derivation schemes reaching cost-sensitive variants of AdaBoost: 
Cost-Sensitive AdaBoost mun], AdaBoostDB [5^ and Cost-Generalized Ada¬ 
Boost [TT] . 

Cost-Sensitive AdaBoost 

The Cost-Sensitive Boosting framework proposed by Hamed Masnadi-Shirazi 
and Nuno Vasconcelos El HU] has its roots in the Statistical View of Boosting 
[7], by adapting the standard loss in equation ([^ with asymmetric exponential 
arguments for each class component. 


J(/(x))=E 


ly = + ly 


_l]eC«/(xi) 


(14) 
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This asymmetric loss is theoretically minimized by the asymmetric logistic 


transform of P [y = l|x) (see Section 2.2), which should ensure cost-sensitive 
optimality. 


fix) 


1 

Cp + Cn 


log 


CpPjy = l|x) 
CNPiv = -l|x) 


The empirical minimization of the asymmetric loss proposed by Masnadi- 
Shirazi and Vasconcelos follows a gradient descent scheme on the space of 
boosted (combined and modulated) binary weak classifiers, resulting in the 
Cost-Sensitive Adaboost algorithm shown in Algorithm As can be seen, 
the final solution involves hyperbolic functions and scalar search procedures, 
being extremely more complex and computing demanding than the original 
AdaBoost. 


AdaBoostDB 

Following the generalizad analysis of AdaBoost [5] instead of the Statistical 
View of Boosting, a different approach to provide AdaBoost with Cost-Sensitive 
properties through a fully theoretical derivation procedure is presented in 
This algorithm, coined as AdaBoostDB (from Double Base), is based on the use 
of different exponential bases (3p and Pn for each class error component, thus 
defining a class-dependent error bound to minimize. 


i—1 i—7n-\-l 


On the one hand, the derivation scheme followed and the polynomial model 
used to address the problem, enable a different and extremely efficient formula¬ 
tion, able to achieve over 99% training time saving with respect to Cost-Sensitive 
AdaBoost (see Algorithm [11]) . On the other hand, this class-dependent error is 


fully equivalent to the cost-sensitive loss (14) defined for Cost-Sensitive Boost¬ 
ing, so both minimizations will converge to the same solution and ensure the 
same formal guarantees. 
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As a result, AdaBoostDB is a much more efficient framework to reach the 
same solution as Cost-Sensitive Boosting (except for numerical errors related 
to the different models adopted, hyperbolyc vs. polynomial). However, despite 
its large improvement in training complexity and performance, AdaBoostDB is 
still much more complex than standard AdaBoost. 

Cost-Generalized AdaBoost 

The Asymmetric AdaBoost problem is addressed in m from a different 
theoretical perspective, realizing that one kind of modification have systemat¬ 
ically been either overlooked or undervalued in the related literature: weight 
initialization. 

Even though some preliminary studies by Freund and Schapire [3], creators 
of AdaBoost, left the initial weight distribution free to be controlled by the 
learner, AdaBoost is “de facto” defined, almost everywhere in the literature 
(e.g. [H Ul [ini Ell EH El El E31 EH EH); with a fixed initial uniform weight 
distribution. From there, some asymmetric boosting algorithms (like AdaCost 
or CSB) use cost-sensitive initialization as a lateral or secondary strategy respect 
to their proposed weight update rules, while others (like AsymBoost or Cost- 
Sensitive Boosting), immediately discard asymmetric weight initialization to 
be “naive” and ineffective, arguing that the first boosting round would absorb 
the full introduced asymmetry and the rest of the process would keep entirely 
symmetric. 

In ED, following a different insight to analyze AdaBoost and obtaining a 
novel error bound interpretation, asymmetric weight initialization is shown to 
be an effective way to reach cost-sensitiveness, and, as occurs with everything 
related to boosting, it is achieved in an additive round-by-round (asymptotic) 
way. All, with the added advantage that weight initialization is the only needed 
change to gain asymmetry with regard to standard AdaBoost (even weight up¬ 
date rule is unchanged). Hence, for whatever desired asymmetry, both complex¬ 
ity and formal guarantees of the original AdaBoost remain intact. 

In this work, we will refer to the algorithm underlying this perspective as 
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Cost-Generalized AdaBoost (see Algorithm [I^. 


4. Theoretical Algorithms: Analysis and Discussion 

Though in the experimental part of our work (see the accompanying paper 
[55] 1 we will show comparative results of all the alternatives presented in the 
previous section, at this point we will focus our attention on the three proposals 
with a fully theoretical derivation scheme: Cost-Sensitive AdaBoost [2ll ITO] - 
AdaBoostDB and Cost-Generalized AdaBoost m- The first important as¬ 
pect we should notice is that these three proposals can be effectively analyzed as 
if they were only two, since Cost-Sensitive AdaBoost and AdaBoostDB, despite 
following different perspectives and obtaining markedly different algorithms, 
share an equivalent theoretical root and drive to the same solution [^. As a 
consequence, if not otherwise specified, in this section we will refer to one or 
another interchangeably, giving priority to the name Cost-Sensitive AdaBoost 
due to its chronological precedence. 

4-1. The Question of Weight Initialization 

As commented in Section |3.1.3[ despite some initial studies pointing to free 
initial weight distributions [3] or works proposing cost-proportional weighting 
as an effective way to transform generic cost-insensitive learning algorithms 
into cost-sensitive ones [34j . subsequent works on boosting have insisted on 
two recurrent ideas: On the one hand, uniform distribution has been assumed 
as the “de facto” standard for weight initialization when defining AdaBoost 
(e.g. HISlIIIllSIlIllITllTlIiaiMlITn]) ; on the other hand, asymmetric weight 
initialization has been systematically rejected as a valid method to achieve cost- 
sensitive boosted classifiers, arguing that it is insufficient [lang or ineffective 

[Ill [241 Ho]. 

However, in m, AdaBoost is demonstrated to have inherent and sound 
cost-sensitive properties embedded in the way the weight distribution is initial¬ 
ized. In fact, the method we are referring to as Cost-Generalized AdaBoost, 
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was not even originally proposed as a new algorithm: “it is just AdaBoost” m 
with appropriate initial weights. Such an analysis, supported by a novel class- 
conditional interpretation of AdaBoost, is, thus, in clear contradiction to the 
supposed ineffectiveness of cost-sensitive weight initialization underlying previ¬ 
ous works. 

In order to definitely clarify this contradiction, we will connect both per¬ 
spectives by demonstrating the validity of asymmetric weight initialization in 
the same scenarios and lines of reasoning that have been previously used in the 
literature to decline its use. 

4-1.1. The Supposed Symmetry 

Masnadi-Shirazi and Vasconcelos m when explaining their Cost-Sensitive 
Boosting framework, immediately discard the unbalanced weight initialization 
(calling it “naive implementation”) with the argument that iterative weight 
update in AdaBoost “quickly destroys the initial asymmetry” obtaining a “pre¬ 
dictor” which “is usually not different from that produced with symmetric initial 
conditions”. Though their statement is not explicitly supported for any further 
test or bibliographic reference, it seems to be extracted from the work by Viola 
and Jones m in which AsymBoost is presented. In that work, the initial weight 
modification technique is rejected arguing that “the first classifier selected ab¬ 
sorbs the entire effect of the initial asymmetric weights”, and assuming the rest 
of the process as “entirely symmetric”. It is because of this seeming problem 
that AsymBoost was designed for distributing an equitable asymmetry among 
a fixed number of rounds. 

The cost-sensitive analysis by Viola and Jones m is illustrated by a four- 
round boosted classifier graphic representation that supports their conclusions 
against asymmetric weight initialization. However, this example can be mislead¬ 
ing: what would happen if boosting were run for more than those four rounds? 
An answer can be found in Figure where we have reproduced and extended 
that illustrative experiment. 

Strictly following Viola and Jones after Figure we could reach the 
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(C) 


Figure 1: Synthetic counterexample to the example by Viola and Jones M , with costs Cp = 4 
and Cjsf = 1, and the same polarity as the original: (a) training set with the first four weak 
classifiers superimposed; (b) weak classifiers after 50 training rounds; (c) Global error evolution 
through 50 training rounds. Weak classifiers are stumps in the linear 2D space. Positive 
examples are marked as ‘+’, ‘o’ are the negative ones, and ‘1’ denotes the first selected weak 
classifier. Positives are the costly class. 


seeming conclusion that, once an initial asymmetric weak classifier has been 
selected, the selection of the remaining weak classifiers is not guided by an 
asymmetric goal. However, as showed by Schapire and Singer [3, AdaBoost is 
an additive minimization process and, as such, it has an asymptotic behavior, 
a kind of behavior that can not be properly judged by stopping after only a 
few training rounds. Running the algorithm for many more rounds in the same 
example (see Figure we appreciate that many other subsequent selected 
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classifiers are, at least, as asymmetric as the first one. 

The class-conditional interpretation of AdaBoost in m shows that the 
asymmetry encoded by the initial weight distribution is actually translated to 
a cost-sensitive global error (a weighted error), and what AdaBoost is actually 
minimizing is a bound on that global error. Thus, instead of inspecting the 
individual asymmetry of each single hypothesis, the cost-sensitive behavior of 
AdaBoost should be evaluated, for correctness, in terms of the cumulative con¬ 
tribution of all the selected weak classifiers giving rise to the strong one. Figure 
shows how, even in a scenario like the one proposed by Viola and Jones [M], 
the classifier obtained by AdaBoost after an asymmetric weight initialization 
follows a real cost-sensitive iterative profile. 

Moreover, postulates by Viola and Jones El and Masnadi-Shirazi and Vas- 
concelos [TOj can also be refuted by simply inverting labels on the same set (see 
Figure]^. As can be seen, no weak classifier is able to satisfy, by itself, the re¬ 
quirements of that “supposed” initial round absorbing the full asymmetry of the 
problem. However, even in such an unfavorable scenario, the desired asymmetry 
is effectively achieved, from cost-proportionate initial weights, after a (boosted) 
round-by-round cumulative process. 

Further comments on these experiments can be found in Appendix [C| 

4-1-2. Weight Initialization inside the Cost-Sensitive Boosting Framework 

Cost-Sensitive AdaBoost m is an algorithm that, despite having a rigorous 
theoretical derivation, is built upon the belief that cost-sensitive initial weighting 
is not a valid method to achieve asymmetric boosted classifiers. However, as we 
have already mentioned, the theoretical analysis in m refutes that supposed 
invalidity. A clarifying experiment at this point is to introduce asymmetric 
weight initialization inside the Cost-Sensitive AdaBoost theoretical framework, 
to assess the theoretical validity of the former with the tools used by the latter. 

Based on the Statistical View of Boosting [7], the cost-sensitive expected 
loss (i.e. the risk function) proposed by Masnadi-Shirazi and Vasconcelos to 
derive Cost-Sensitive AdaBoost, consists on two class-dependent exponential 
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(a) 


(b) 



(c) 


Figure 2: Synthetic counterexample to the example by Viola and Jones M , with costs Cp = 4 
and Cn = F s,nd with opposite polarity to the original: (a) training set with the first four 
weak classifiers superimposed; (b) weak classifiers after 50 training rounds; (c) Global error 
evolution through 50 training rounds. Weak classifiers are stumps in the linear 2D space. 
Positive examples are marked as ‘+’, ‘o’ are the negative ones, and ‘1’ denotes the first 
selected weak classifier. Positives are the costly class. 

components with asymmetry embedded in its exponents: 


JcSAifix)) = E [|y = 1] exp(-C'p/(x)) + fy = -I] exp(C'Ar/(x))] 

Following the proof derivation scheme in m. if the derivatives of this loss 
are set to zero, we will obtain the function of minimum expected loss (minimum 
risk) conditioned on x for Cost-Sensitive AdaBoost, that, as can be seen, is 
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based on the asymmetric logistic transform of P{y = l|x). 


JcSAifix)) = P{y= l|x) exp(-C'p/(x)) + P {y = -l|x) exp(CAr/(x)) 


i9Jc5a(/(x)) 

9/(x) 


-CpP (y = l|x) exp(-C'p/(x)) + CnP [y = -l|x) exp(C'iv/(x)) = 0 


CpPjy = i|x) 
CnPIv = -l|x) 


= exp((C'p + C'Ar)/(x)) 


/cga(x) 




1 

Cp + Cn 


log 


/ CpP(y=l|x) \ 
'vCArP(y = -l|x)y 


Now, let us suppose that the two cost parameters Cp and Cn, rather than 
in the exponents, are incorporated as direct modulators of the exponentials 


(15). This procedure is equivalent to model the initial weight distribution by 
means of two uniform class-conditional distributions, respectively modulated by 
Cp/ {Cp -\-Cn) and Cn/ {Cp + C'jv), i.e. an asymmetric weight initialization 
as the one proposed giving rise to Cost-Generalized AdaBoost. 


JcGA{f{y^)) = E [|y = llCp exp(-/(x)) + fy = -IICat exp(/(x))] (15) 

If we repeat the above derivation scheme on this new loss, we will find the 
function of minimum expected loss (minimum risk) conditioned on x for Cost- 
Generalized AdaBoost: 
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JcGAifi^)) = p{y = l|x)C'pexp(-/(x)) + P(?/ = -l|x)C 7 vexp(/(x)) 


dJcGAifi^)) 

9/(x) 


P{y= l|x)C'pexp(-/(x)) + P {y = -l|x) Cat exp(/(x)) = 0 




As can be seen, the obtained minimizer is also based on the asymmetric logis¬ 
tic transform of P{y = l|x), showing us that, even from the Cost-Sensitive Ada- 
Boost derivation perspective, there is no reason to discard asymmetric weight 


initialization as a valid approach to build cost-sensitive boosted classifier^ 


4-2. Comparative Analysis of the Theoretical Approaches 

As we have seen, among the three asymmetric AdaBoost algorithms with 
a full theoretical derivation, two of them (Cost-Sensitive AdaBoost and Ad- 
aBoostDB) drive to the same solution, while the other one (Cost-Generalized 
AdaBoost) has been shown to guarantee the same theoretical validity than its 
counterparts. At this point, we may wonder if Cost-Generalized AdaBoost is 
also obtaining the same solution as Cost-Sensitive AdaBoost/AdaBoostDB. As 
we will see in the experimental part of our work (in the second part of this series 
of two papers [53]) the answer to this question is “no”: classifiers obtained by 
Cost-Sensitive AdaBoost and Cost-Generalized AdaBoost in the same scenarios 
are markedly different. In this section, from a theoretical perspective, we will 

^As analyzed in Appendix]^ the way asymmetry is applied across the different boosting 
variants covered by the Cost-Sensitive Boosting framework m is not homogeneus either. In 
fact, despite having discarded cost-proportionate weight initialization as a valid method, one 
of the algorithms (Cost-Sensitive LogitBoost) proposed in the same work is actually based on 
that strategy. 
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analyze the differences between the two algorithms, with the aim of achieving 
the intrinsic distinctivenesses of their respective classifiers. 


4 . 2 . 1 . Error Bound Minimization 

As commented in Section the most common detection problem can be 
parametrized by the next cost matrix: 


C = 


^np 



^pn ^pp 

We will start our comparative analysis by following the error bound mini¬ 
mization perspective originally proposed by Schapire and Singer [5], also used 
in the derivation of Cost-Generalized AdaBoost and AdaBoostDB. From that 
point of view, classical AdaBoost, with its initial uniform weight distribution, 
is an algorithm driven to minimize an exponential bound (Et) on the training 
error (Et) (16), as illustrated in Figure]^ In that figure, the horizontal axis 
represents the performance score of a classification, whose sign indi¬ 
cates the success (if yif{xi) > 0) or failure (if yif{xi) < 0) of the decision, and 
whose magnitude indicates the confidence expected by the classifier on its de¬ 
cision. The exponential bound is decreasing for increasing performance scores, 
so the classical AdaBoost minimization process is aimed to maximize correct 
classifications and their margin (distance to the boundary), in a scenario where 
all the training examples follow a common cost scheme. 


IL ^ IL ^ 

Et = Y^ + yzj < ^ - exp (-yi/(xi)) = Et (16) 

i—\ i=\ 

Cost-Sensitive AdaBoost and AdaBoostDB, assumming that the training 
set is divided into two significant subsets (positives and negatives), define two 
different exponential bounds {Etp and Etn) with different associated costs (Cp 
and Cm) over each subset. These costs are inserted as exponent modulators into 
each class-dependent exponential bound 0. reaching a cost-sensitive behavior 
that can be graphically interpreted as shown in Figure]^ The goal is, again. 
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Figure 3: Training error bound of AdaBoost. The loss (y-axis) associated to each decision has 
an exponential dependency on the performance score of the strong classifier (x-axis). 

to maximize correct classifications and their margin, but this time in a scenario 
where positives and negatives have different associated losses. 




( 17 ) 


Etp Exn — E']^ 


As can be seen, asymmetric modifications in Cost-Sensitive AdaBoost (and 
AdaBoostDB) are based on new bounds for the training error, while the error 
definition itself remains unchanged from original (cost-insensitive) AdaBoost. 

Cost-Generalized AdaBoost, on the other hand, is based on redefining the 
training error and then applying the standard exponential bounding process. 
To achieve this, training error in positives {Etp) and in negatives {Etn) are 
computed separately, and then are modulated by its respective normalized 
costs. The resulting class-dependent weighted error components {E'tp and 
E'tn) jointly define the cost-sensitive global training error {E't). The same 
way as in standard AdaBoost, each of these weighted error components can be 
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Figure 4: Training error bound of Cost-Sensitive AdaBoost and AdaBoostDB for Cp = 2 and 
Cat = 1. Loss has a class-dependent definition and is composed of two different exponential 
functions. 


exponentially bounded {E^p and Eppi)^ and the combination of the two re- 
suiting class-dependent bounds will define a cost-sensitive global bound (Et) 


(18), that is the function being minimized by Cost-Generalized AdaBoost. The 
scenario is graphically depicted in Figure 


Elj^ - E'rpp -\- E'rpM - 


Cf 


^rpp -r 

Op -h Oat 
Cp 


E' 


C. 
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TP 
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^ — [Ff(x,) 7^ pij -h 
1—1 

„ m 

^ y] — exp(-yi/(xi)) 


Cp + Cn 
Cn 


-E’ 


TN 


Cp Cn "Pt rn 
1—1 


Cp Cn rn 
1—1 

= Epp Epjp = Ep 


Cp -\-Cn 

E 1 

Cp -I- Cat n — m 


(18) 


It is important to notice that, by definition, all these algorithms have the 
goal of obtaining the best possible classifier able to deal with the problem in a 
cost-sensitive sense, and that the bounding loss functions Ep are a mere math¬ 
ematical tool to make the minimization problem tractable. Thus, from a formal 
point of view, the direct definition of a cost-sensitive error to be subsequently 
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Figure 5: Training error bound of Cost-Generalized AdaBoost for Cp = 2 and Cn = !• Loss 
keeps again an exponential dependency, but now modulated by a class-dependent behavior. 


bounded, as proposed by Cost-Generalized AdaBoost, seems to be more suitable 
than using the standard cost-insensitive error and manipulate its bound to be 
asymmetric, as suggested by Cost-Sensitive AdaBoost or AdaBoostDB. 

Figure illustrates the prevalence of the class-dependent error bounds of 
the two algorithms, assuming, without loss of generality, that positives have a 
greater cost than negatives Cp > Cp; (the opposite case can be modeled by 
a simple label swap). As can be seen, in Cost-Generalized AdaBoost (Figure 
the loss associated to positives is always greater than the loss associated to 
negatives, and the ratio between the two class-dependent losses remains constant 
along the performance scores. However, in Cost-Sensitive AdaBoost (Figure 
the ratio between losses varies according to the score, to the extent that 
class prevalence is inverted depending on which side of the success boundary 
= 0) we are. 

The iterative learning process behind AdaBoost builds a predictor function 
/(x,;) aimed to progressively (round by round) minimize the respective loss 
function over the training dataset. In terms of classification, this means that 
AdaBoost classifiers are trained not only to maximize the accuracy of the clas¬ 
sifier over the training set, but also to maximize the margin of its decisions. So, 
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(a) 


(b) 


Figure 6: Class prevalence of error bounds for Cost-Generalized AdaBoost (a) and Cost- 
Sensitive AdaBoost (b) (Cp = 2, Cn = 1). 


once one training example is correctly classified, the tendency of the learner will 
be to continue increasing the confidence of its prediction (abs(/(xi))) to move it 
away from the decision boundary (/(x^) = 0). For Cost-Generalized AdaBoost, 
this means that any positive training example will always be more costly (and 
in the same ratio) than any negative example with its same performance score, 
whatever this score is. However, in the case of Cost-Sensitive AdaBoost, preva¬ 
lence ratio varies exponentially with performance scores. So, when scores are 
positive, negative training examples become the prevalent ones. 

Bearing in mind that the performance score of any training example, at any 
iteration of the learning process, is determined by the evaluation over the exam¬ 
ple of the boosted predictor learned so far, and that the weight of this example 
for the next learning round will depend on the value of the related bounding 
loss for that particular score, we can draw the two following consequences: 

• In Cost-Generalized AdaBoost positives will always be the costly class, 
and the same cost asymmetry is preserved throughout the whole learning 
process. 

• In Cost-Sensitive AdaBoost cost asymmetry changes. While the classifier 
is wrong, positives are the costly class (learning is positive-driven), but 
when classification is correct, negatives are prevalent (learning is negative- 
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driven). The more accurate the classifier obtained is, the more costly will 
be negatives over positives in subsequent training rounds. 

In terms of training error, these differences seem to be anecdotal, since the 
change of class prevalence occurs once the classifier succeeds for each example. 
However, what is really relevant, is the effect in terms of generalization error: 
when the classifier works on unseen instances it will make mistakes and it is 
essential, from a cost-sensitive perspective, to characterize which class is the 
most prone to errors and to what extent. 

As the iterative training process progresses, the performance scores associ¬ 
ated to the training examples tend to increase, and their respective losses tend 
to decrease moving along the Y axis on Figures and so, the more rounds 
we train, the more on the right of these figures we will be. In the case of 
Cost-Sensitive AdaBoost this trend will increasingly emphasize negatives at the 
expense of positives, while Cost-Generalized AdaBoost keeps the ratio between 
classes intact throughout the whole learning process. Thus, due to its changing 
emphasis, Cost-Sensitive AdaBoost may run the risk of obtaining classifiers in 
which the supposed costly class is the most prone to errors: just the opposite 
of what was originally intended! 

In the companion paper of the series [33] we will see empirical evidences 
confirming this asymmetry swapping behavior that, by definition, is expected 
to be more noticeable the closer the system is to overfitting, but that may 
have an implicit detrimental effect on the performance reached by all classifiers 
trained by Cost-Sensitive AdaBoost. 

4-2.2. Statistical View of Boosting 

Instead of the exponential error bound minimization perspective that orig¬ 
inally gave rise to AdaBoost (and that also is the derivation core of Cost- 
Generalized AdaBoost and AdaBoostDB) we will now adopt a different point 
of view: the Statistical View of Boosting [7] , the other major analytical frame¬ 
work to interpret and derive AdaBoost that, in addition, is the foundation of 
Cost-Sensitive AdaBoost. 
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As we have seen in Section 2.2 from the Statistical View of Boosting per¬ 
spective, AdaBoost can be interpreted as an algorithm that iteratively builds 
an additive regression model based on the following loss function: 


lAB{fi^),y) =exp(-y/(x)) 


From that loss, an associated risk function J^b(/(x)) (the expected loss) is 
defined: 


>/ab(/(x)) = E [;ab(/(x), y)] 

= P{y= l|x) exp(-/(x)) + P{y = -l|x) exp(/(x)) 

If we minimize that risk we will obtain the optimal predictor /^^^(x), that 
turns out to be the symmetric logistic transform ol P {y = l|x). 


/ab(x) = - log 


P{y = l|x) 
P{y = -l|x) 


AdaBoost is geared to approximate, in an additive way, that optimal predic¬ 
tor without embedded costs. Thus, the obtained model will be cost-insensitive, 
only depending on the likelihood of each class (see Figure [^. 

In the case of Cost-Generalized AdaBoost, from this same perspective, we 
will have a loss function in which costs are included as modulators of the expo¬ 
nentials. 


lcGA{f{x),y) = ly = llC'pexp(-/(x)) + ly = -l|CArexp(/(x)) 


Thus, as explained in Section 4.1.2 the respective risk function JcGAifi^)) 
and its minimizer /cga(x) will be the following ones: 


JggaU(.^)) = E [Icga(/(x), y)] 

= P{y = l|x)C'pexp(-/(x)) -hP(y = -I|x)C'wexp(/(x)) 
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Figure 7: Risk minimizing function (optimal predictor) for AdaBoost (/as(x)). It only 
depends on the likelihood of each class. 



As can be seen, now we have a cost-sensitive risk function with a cost- 
sensitive minimize! gearing to an optimal predictor fcGA (x) based on the asym¬ 
metric logistic transform of P {y = l|x). Thus, in contrast to AdaBoost, the 
model pursued by Cost-Generalized AdaBoost will not exclusively depend on 
the likelihood of each class, but also on the related costs. 

On the other hand, the loss function of Cost-Sensitive AdaBoost embeds the 
costs inside the exponents 


IcsAifi^), y) = ly = 11 exp(-Cp/(x)) + ly = -1| exp(C'Ar/(x)) 


so the risk function and its associated minimize! will be as follows (see Section 


4.1.2): 


^csa(/(x)) = E [lcsA{f{x),y)] 

= P{y = l|x) exp(-C'p/(x)) + P{y = -l|x) exp(C'Ar/(x)) 
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Figure 8: Risk minimizing function (optimal predictor) for Cost-Generalized AdaBoost 
(/cga(x)). It depends on the likelihood of each class and on the related costs, having a 
homogeneous and continuous cost-sensitive behavior for whatever likelihood. 

f f CpP{y = l\yi) 

fcSAi^) ^ log ^ 

Then, Cost-Sensitive AdaBoost is also aimed to fit a model based on the 
asymmetric logistic transform oi P (y = l|x), depending both on the likelihood 
of each class as well as on the related costs (see Figure [^. 

Notwithstanding, the optimal predictors guiding Cost-Sensitive AdaBoost 
and Cost-Generalized AdaBoost, despite being both cost-sensitive, have differ¬ 
ent equations. Such differences become apparent in their graphic representations 
(see Figures]^ and [^. 

To delve into the consequences of these differences, we will analyze the op¬ 
timal predictors of Cost-Generalized AdaBoost and Cost-Sensitive AdaBoost 
as functions depending on two magnitudes: likelihood and cost asymmetry 

^In the case of Cost-Sensitive AdaBoost (and AdaBoostDB) we can actually distinguish 
three different involved magnitudes (likelihood, cost of positives and cost of negatives), since 
the optimal predictor changes when costs are multiplied by a positive factor. This behavior 
(that does not happen for Cost-Generalized AdaBoost) violates the rules of the cost matrix 
m explained at the beginning of Section]^ In order to tackle this problem for our analysis, 
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Figure 9: Risk minimizing function (optimal predictor) for Cost-Sensitive AdaBoost 
(/cs>l(x)). It depends on the likelihood of each class and on the related costs, but in this 
case the cost-sensitive behavior is not homogeneous with respect to likelihood (solutions for 
different costs cross each other depending on P (y = l|x)). 


In Figure 10 we have represented the outputs of the optimal predictors as col- 
ormaps (we have used isolines for the sake of clarity) onto the plane defined by 
the likelihood and the cost asymmetry. As can be seen, the optimal predictor 
of Cost-Generalized AdaBoost (Figure [lO^ ) obtains higher predictor values for 
increasing P {y = l|x) and increasing Cp (vice versa for negatives). However, 
that is not the case for Cost-Sensitive AdaBoost (Figure [T^) where, for a given 
likelihood, we can find lower predictor outputs for increasing positive costs (and 
vice versa for negatives). This inhomogeneous behavior can explain the asym¬ 
metry swapping effect we have commented in Section 4.2. 1[ and to which we 
will come back in the companion paper of the series |33j when analyzing the 
experimental behavior of Cost-Sensitive AdaBoost. 


we have restricted the possible costs to combinations (Cp, Cn) in which one of the coefficients 
is always 1, and the other one is > 1. This decision allows us to homogeneously interpret the 
scenarios in which negatives are the costliest class as label inversions. 
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(a) 


(b) 


Figure 10: Isolines of the optimal predictors for Cost-Generalized AdaBoost (a), and Cost- 
Sensitive AdaBoost (b), with respect to the likelihood {P {y = l|x)) and the normalized cost 
asymmetry (7 = Cp jiCp Cjv))- 


5. Summary and Conclusions 

In this first paper of the series we have introduced our working scenario, 
presenting the algorithms under study (AdaBoost with threshold modification 
[5]; AsymBoost [T3]; AdaCost [T^; CSBO, CSBl and CSB2 [^; AdaCl, 
AdaC2 and AdaCS [301 US]; Cost-Sensitive AdaBoost (OOUTOj; AdaBoostDB [35] : 
and Cost-Generalized AdaBoost m) in a homogeneous notational framework 
and proposing a clustering scheme for them based on the way asymmetry is 
inserted in the learning process: theoretically^ heuristically or a posteriori. Then, 
for those algorithms with a fully theoretical derivation, we performed a thorough 
theoretical analysis and discussion, adopting the different perspectives that have 
been used to explain and derive the related approaches in the literature (Error 
Bound Minimization perspective [5] and Statistical View of Boosting m)- 

The presented analysis clearly shows that the asymmetric weight initializa¬ 
tion mechanism used by Cost-Generalized AdaBoost, from whatever point of 
view, is definitely a valid mechanism to build theoretically sound cost-sensitive 
boosted classifiers, despite having being recurrently overlooked or rejected in 
many previous works (e.g. [I5l[22|[l4ll2lllin[). In addition, and besides being 
the simplest algorithm, Cost-Generalized AdaBoost exhibits the most consis- 
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tent error bound definition and it is able to preserve the class-dependent loss 
ratio regardless of the training round whereas Cost-Sensitive AdaBoost and Ad- 
aBoostDB, the other theoretical alternatives, may end up emphasizing the least 
costly class. 

After this purely theoretical study, an empirical analysis of the different 
approaches, also including the non-fully-theoretical methods (a posteriori and 
heuristic), is needed to reach global conclusions and culminate the analysis we 
have started in this paper. Such experimental part can be found in the next 
article of the series: “Untangling AdaBoost-based Cost-Sensitive Classification 
Part II: Empirical Analysis” [55] . 
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Appendix A Unifying Compendium of Algorithms 


Next, a compendium of all the algorithms studied in this series of papers is 
presented, following an homogeneous and unifying notation to facilitate their 
comparative analysis. Asterisks indicate changes with regard to standard Ada- 
Boost. 


Algorithm 1 AdaBoost 

Input: 

Training set of n examples: (xi,yi), where yi — 

Pool of F weak classifiers: /i/(x) 

Number of rounds: T 

Initial (uniform) weight distribution: D{i) 


1 if 1 < 2 < m (Positives), 
— 1 if m < 2 < n (Negatives). 


Iterate: 

for t — 1 to T do 
for / — 1 to P" do 

Compute the weighted error of the weak classifier, €t,/ = I^/^ 

end for 


Select the weak learner /it(x) of smallest error, et = argmin [e* /] 

/ 

Compute the goodness parameter of the selected classifier, a* — ^ log 
Update weights: D{i) -tr- 

-D(i) exp(-Qtyjh4(x.£)) 

end for 


Final Classifier: 

ff(x) = sign (ur=l “t/it(x)) 
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Final Classifier: 

H{x) = sign (x;Ll 
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Final Classifier: 

H{x) = sign (j2l=i atht(x)^ 
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Algorithm 4 CSBO 

Input: 


ll ifl<f<m (Positives), 

Training set of n examples: (xi,yj), where — I 

I —1 if m< 2 <n (Negatives). 

Pool of F weak classifiers: /iy(x) 

Number of rounds: T 
-k Cost parameters: Cp, Cn ^ 1^"*” 

Initialize: 

* Weight Distribution: D{i) — 

Iterate: 

for t — 1 to T do 
for / — 1 to P" do 

Compute the weighted error of the weak classifier, D('i)|/i^(x^) 7 ^ yi\ 

end for 


f if 1 < * < 

1 r^Cp+'l^-m.)c^ ifm<r<n. 


Select the weak learner /it(x) of smallest error, et = argmin [e^ y] 

/ 

Compute the goodness parameter of the selected classifier, a* — ^ log 
k Update and normalize weights: 



r Di^) 

if hy (xi) 

= Vi, 

£>(») ^ 

- 1 CpD(i) 

if hy (xi) 

Vi and yi = 1, 


i Cjv-D(i) 

if fi/(xi) 

5 ^ Vi and Vi = - 

£>(») ^ 

D(i) 

E?=l O(’) 




end for 


Final Classifier: 

*H{x) = sign atht{x) (Cplht{x) = +1] + Cjvlf!.t(x) = -I])) 
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Final Classifier: 

*H{x) = sign (j2j=i (Crlhtix) = +1] + Cjvl?it(x) = -I])) 
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Final Classifier: 

*H{x) = sign (j2j=i (Crlhtix) = +1] + Cjvl?it(x) = -I])) 
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Final Classifier: 

H{x) = sign (Y,I=i atht{x)'^ 
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Final Classifier: 

H{x) = sign (Y,I=i atht{x)'^ 
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Final Classifier: 

H{x) = sign (Y,I=i atht{x)'^ 
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Algorithm 10 Cost-Sensitive AdaBoost 

Input: 


Training set of n examples: (xi,yi), where yi — 

Pool of F weak classifiers: /i/(x) 

A Cost parameters: Cp, Cn ^ 

Number of rounds: T 

Initial (uniform) weight distribution: D{i) 


1 if 1 < 2 < m, 
— 1 if m < 2 < n. 


for f = 1 to T do 


•k Calculate parameters: 

rp = E™ 1 £’(») 

Tn = E?=m + 1 D{i) 

for / — 1 to P" do 

Pick up weak classifier: h/(x). 


■k Calculate parameters: 

B = EEi D{i)lyi ^ /i/(xi)]. 

■D = Er=™+i Dii)lyi ^ /i/(xi)]. 

* Find Oit,f solving the next hyperbolic equation: 

2CpBcosh (Cpatj) + 2CivX> cosh (Cjvat,/) = CiTpe-^P‘^*’f + 


-k Compute the loss of the weak learner 

Lt,f = B )+r„e"^« 


end for 


Select the weak learner (/it(x), Qt(5c)) of smallest loss in this round: argmin [L-t f] 

f 

k Update weights: 


D(^) 


D{i) exp { — Cpoctht (xi)) if 1 ^ ^ ^ 
D{i) exp {CNOctht (xi)) if m < i < n. 


end for 


Final Classifier: 

H{x) = sign (eLi «tht(x)) 
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Algorithm 11 AdaBoostDB 

Input: 


Training set of n examples: (xi,yi), where yi — 

Pool of F weak classifiers: /i/(x) 

* Cost parameters: Cp, Cn ^ 

Number of rounds: T 

Initial weight distribution: D{i) 


1 if 1 < i < m, 
— 1 if m < f < n. 


Dp(i) = 2#^ 

Weight subdistributions: \ 

1 

-k Accumulators: Ap — 1, An — 1- 


if m < f < n. 


for f = 1 to T do 


k Minimum root: r — 1 


k Minimum root vector: r— (2,2) 
Scalar product: s = 1 


k Update accumulators: 


Ap -ir- Ap Dp{i), 

An An -Div(i). 


-*• Normalize weight subdistributions: 


Dp(i) 

-PjvCO 

^r=m+l ■ 


k Calculate static parameters: 


for / = 1 to F do 


^ _ CpAp 

CpAp+CN^N ’ 

u _ 

Cp Ap -\-CN N 


, , J epj =I2T=iDpii)lyi hf(xi)l, 

k Calculate variable parameters: < 

1 ejv,/ = Er=m + 1 

* Calculate current classifier vector: c — {a ■ sp^f, b ■ SN,f) 

k CONDITIONAL SEARCH 

if a ■ £p,f + b ■ eN,f < ^ [Contribution Condition] then 
if c • r > s [Improvement Condition] then 

Search the only real and positive root r of the polynomial: 

(a-epj)x'^^P +(b-eNj)x^P+^P' +&(£«,/ - l)x'^P-^« +a(ep,/ - 1) = 0 


Update parameters: 


(-r^Cp + l ^ ^Cp + c^ _|_ 


Keep hf{i) as round t solution. 


k Calculate goodness parameter: at — log ('^) 


k Update weights subdistributions: 


Dp(i) <- Dp{i)eyL-p{-Cpatht(i)), 
-Div(i) <- D]v{i) exp{CNOLtht{i)). 


Final Classifier: 

H{x) = sign atht{x)'^ 
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Final Classifier: 

H{x) = sign (j2l=i atht{x)'^ 
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Appendix B Other Cost Scenarios 


As we have analyzed so far, the most common detection problem is that in 
which cost coefficients are null for correct decisions (c„„ = Cpp = 0) but non-zero 
for mistakes {Cnp,Cpn > 0). Thus, we have distinguished between two “usual” 
scenarios: 

• Cost-Insensitive (Symmetry): Regardless of the class, all mistakes have 
the same cost c„p = Cpn (Figure flT^). 

• Cost-Sensitive (Error Asymmetry): Mistakes in positives are costlier than 
mistakes in negatives c„p > Cp„ (Figure [T^). 

However, “reasonableness conditions” (c„„ < c„p and Cpp < Cp„) [15] com¬ 
mented in Section still allow other possible scenarios depending on how costs 
are defined: 


• Correct Classification Asymmetry: All mistakes have the same cost c„p = 
Cpn, but correct decisions on positives are costlier than on negatives Cpp > 
Cnn (Figure [TT|:). 


• Dual Asymmetry: Correct and wrong decisions on positives are costlier 
than correct and wrong decisions on negatives respectively, Cpp > Cnn, Cnp > 
Cpn (Figure [TTji). 


• Reversed Dual Asymmetry: Mistakes on positives are costlier than on 
negatives c„p > Cp„, while correct decisions are costlier on negatives than 
on positives > Cpp (see Figure 11 3 ). 


In all these cases we have supposed, without loss of generality, that the cost 
of mistakes in positives is always greater or equal than the cost of mistakes 
in negatives (c„p > Cp„ > 0), since the opposite case can be modeled just by 
swapping labels. 

As we have seen, classical AdaBoost, with its standard exponential bound, is 
aimed to deal with the cost-insensitive case (Figure |l2^), while Cost-Generalized 
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Sl/(X.) 


(e) C=(>0>1) 


Figure 11: Different cost scenarios [Symmetry (a), Error Asymmetry (b), Correct Classifi¬ 
cation Asymmetry (c) Dual Asymmetry (d) and Reversed Dual Asymmetry (e)] with their 
corresponding cost matrices. 
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AdaBoost seems to be best suited than Cost-Sensitive AdaBoost for the stan¬ 
dard cost-sensitive (error asymmetric) scenario (Figure [T^ ). But, what would 
happen in the other asymmetric scenarios? Which of the two theoretical variants 
is best suited for each case? 

• Correct Classification Asymmetry. The positive class is prevalent for posi¬ 
tive performance scores while no class is prevalent for negative ones. Cost- 
Generalized AdaBoost, with a smoothed asymmetry, seems to be the most 
suitable scheme (Figure [T^). 

• Dual Asymmetry: The positive class is hegemonic throughout the whole 
performance score space, and the only difference of being on either side 
of the success boundary is the cost ratio between the two classes. Cost- 
Generalized AdaBoost, in this case with a more pronounced asymmetry, 
may be the most appropriate scheme (Figure [T^). 

• Reversed Dual Asymmetry. The costlier class changes depending on being 
mistaken or not. In this case, Gost-Sensitive AdaBoost, taking advantage 
of its class-prevalence reversal, seems to be the most suitable model for 
the problem (Figure [T^). 

Up to this point we have assumed that cost coefficients are constant, so all the 
examples belonging to the same class have the same associated cost. However, 
a cost matrix with variable coefficients would entail that different examples of 
the same class may have different costs. In general terms, cost requirements of 
any classification problem can be split into two levels: a Class-Level regarding 
the cost ratio between classes (the global emphasis given to each class), and an 
Example-Level regarding the cost distribution within a given class. When cost 
coefficients are constant, class-level is the only kind of asymmetry involved in 
the problem, but when costs are variable both levels can be present. 

As analyzed in m, this asymmetry breakdown can be immediately mapped 
to the Cost-Generalized AdaBoost framework by only defining a set of param¬ 
eters from the initial weight distribution D{i) given to the algorithm: 
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(e) 



>0 > 1 \ 
1 0 I 


Figure 12: Boosting models applied to the cost scenarios in Figure [TT| [AdaBoost for Symmetry 
(a), Cost-Generalized AdaBoost for Error Asymmetry (b), Cost-Generalized AdaBoost for 
Correct Classification Asymmetry (c), Cost-Generalized AdaBoost for Dual Asymmetry (d) 
and Cost-Sensitive AdaBoost for Reversed Dual Asymmetry (d)]. 
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• Asymmetry: 

m 

=7 6(0,1) 

2=1 


• Class-conditional weight distributions: 

m 


Dp{i) = 


for i = 1,.... 


7 




DN(i) = -, for z = m + 1,..., n 

1-7 


7 

While 7 quantifies the class-level global asymmetry of the problem, class- 
conditional weight distributions Dp(i) and Dp[{i) describe how some examples 
are emphasized within each class (example-level). Hence, initial weights D{i) 
are coupling both cost levels, and determine the specific exponential bound ap¬ 


plied to each example throughout the minimization procedure (19). As a result, 
by simply defining a proper weight initialization scheme, Cost-Generalized Ada- 
Boost is able to model any class-level or element-level asymmetric cost scenario 
(without class prevalence reversal) and it also preserves the same computational 
complexity for all cases. Figure illustrates the effect of mapping variable 
cost coefficients to different weights in Cost-Generalized AdaBoost. 


n n 

Et = ^D{i)lH{-Ki) y,l < ^ £)(i) exp (- 2 /,/(x,)) = Ep (19) 
2=1 2=1 

In case of having variable cost coefficients and class prevalence reversal (Re¬ 
versed Dual Asymmetry), a modification of Cost-Sensitive AdaBoost or Ad- 


aBoostDB is needed. Such a variation would require distinct exponent modula¬ 


tors for each cost, so the resulting global error bound would be modeled (20) as 
a sequence of exponential factors with different bases related to each different 


cost. However, minimization of this bound will be increasingly complex depend¬ 
ing on the number of distinct “discrete costs” we have in the training set, to 
the point that the process may end up being unfeasible: as we will see in the 
accompanying paper of the series |33], just passing from one base (Standard Ada¬ 
Boost and Cost-Generalized AdaBoost) to two different bases (Cost-Sensitive 
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Figure 13: Variable cost coefficients mapped to different initial weights on Cost-Generalized 
AdaBoost (a), and to different exponential bases in Cost-Sensitive AdaBoost(b). 


AdaBoost/AdaBoostDB with constant coefficients) implies that training time 
becomes 18 times longer, on average, even using the most efficient of the two 
alternative^ The graphical behavior of Cost-Sensitive AdaBoost when variable 
cost coefficients are mapped to different exponential bases can be visualized in 
Figure [T^. 


mi ^ m 2 1 n ^ 

Et = ^yil+ -lH{x,) ^yij +...+ ^ ?/*l 


2=1 

mi 


A—^ n 


1 —yi/(xi) 


i—mi-\-l 
m 2 


T. -J'. 


1 —yi/(xi) 


2=1 


i—mi +1 


i=m.k + l 


i=mk+l 

1 , 

n 


Y. 


( 20 ) 


Table summarizes all these conclusions. As can be seen, Cost-Generalized 
AdaBoost dominates most of the possible cost scenarios, including the most 
common ones (symmetry and standard asymmetrjQ, it is able to model both 
constant and variable cost coefficients and it also preserves the same computa- 


^As shown in [25], AdaBoostDB is, on average, 200 times faster than Cost-Sensitive Ada¬ 
Boost 

^As analyzed in m AdaBoost and Cost-Generalized AdaBoost can be actually considered 
as one algorithm, preserving both theoretical properties and computational complexity for 
any cost requirements. 
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tional complexity for all cases. 


Scenario 

Cost Matrix 

1 Level of Asymmetry 

Class-Level 

Example-Level 

Symmetry 

C = 

(:;) 


AdaBoost 

(Cost-Generalized 

AdaBoost) 

Cost-Generalized 

AdaBoost 

Standard Asymmetry 

c= 1 

f 0 > 1 ^ 
A 0 j 

1 

Cost-Generalized 

AdaBoost 

Cost-Generalized 

AdaBoost 

U nbalanced 

Asymmetry 

c= 1 

f 0 > 1 ^ 

A >0) 

1 

Cost-Generalized 

AdaBoost 

Cost-Generalized 

AdaBoost 

Reversed Asymmetry 


>0 >1 

1 0 

) 

Cost-Sensitive AdaBoost 

Cost-Sensitive AdaBoost 

(increasingly complex) 


Table 1: Summary of the proposed mapping between AdaBoost algorithms and the different 
asymmetric scenarios. 
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Appendix C Comments on Figures and 

If we analize Figures]^ and[^ we will see that, before error profiles start to 
iteratively evolve, both classifiers have an initial period of “flat” performance 
progress. 

• In the case of Figure [Tj: the classiher obtained after the first round yields 
null classification error in positives and 0.7 error in negatives, maintaining 
the same performance until round 6. 

• In Figurej^, the first round classifier gets null error in positives and total 
error in negatives (it is an “all-positives” classifier), keeping this global 
performance unchanged until round 9. 

What is happening during these seemingly “stubborn” rounds? Does an “all¬ 
positives” or an “all-negatives” weak classifier makes sense inside the AdaBoost 
framework? 

In the first round the weak learner selects the best weak classiher for the 
initial weight distribution. Bearing in mind that initial weights dehne the overall 
desired asymmetry for the whole problem m , the hrst learning round is actually 
searching a weak classiher “as if” it was going to be the only one in the ensemble. 
Depending on the topology of the classes, the pool of weak classihers and the 
desired asymmetry, it could happen that the best single weak classiher dealing 
with the problem is an “all-positives” or an “all-negatives” one. As we can easily 
see in Figure due to the spatial distribution of both classes, their relative 
costs and the weak classihers we have used (stumps in the linear bidimensional 
space), the best possible single weak classiher is just an “all-positives” one. 
This effect can be interpreted as a global asymmetry adjustment, htting the a- 
priori probability of each class and smoothing the weight distribution, so the 
subsequent training rounds can select new weak classihers to jointly build a 
more accurate strong classihcation. Note that classihers depicted in Figures 
and are the hrst four weak classihers obtained by AdaBoost excluding the 
“all-positives” or “all-negatives” weak-classiher rounds. 
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Also depending on the topology of the problem and the desired asymmetry, 
once the first round is trained, it is possible that the best weak classifier of the 
next iteration has not “enough goodness” (at) to change the decision bound¬ 
aries of the strong classifier. In that case, even though the predictor will evolve 
according to the incorporated weak hypothesis, no changes in the performance 
of the strong classifier will be perceived. This situation may be repeated for 
several iterations, and it is the responsible of the aforementioned initial flat sec¬ 
tions in Figures and[^. The key is that no single weak classifier has enough 
complexity to effectively contribute to change itself the strong classifier, so sev¬ 
eral consecutive weak hypothesis must be gathered to jointly achieve enough 
performance to change the decision boundaries. 

These two phenomena (“global asymmetry adjustment rounds” and “flat 
sections’) are not exclusive of the first round, and can be found, with the 
same meaning, at any other different point of the training process different 
(e.g. rounds 13 to 18 in Figure [^). 
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Appendix D Cost-Sensitive Boosting Extensions and Weight Ini¬ 

tialization 

Original (Discrete) AdaBoost can be generalized to deal with weak hypothe¬ 
ses that, instead of being binary { — 1,1}, are defined over a continuous range in 
K [S]. This extension of the algorithm is usually known as Real AdaBoost [7], 
and it is based on the minimization of the same exponential loss as in the discrete 
case. As a consequence, for cost-sensitive purposes, the same loss modification 
and weight initialization strategies used in Cost-Generalized AdaBoost m can 
be applied, preserving again all the theoretical guarantees of the symmetric Real 
AdaBoost version. 

Besides Cost-Sensitive AdaBoost, in [10] a Cost-Sensitive RealBoost version 
is also proposed. For its derivation, the authors use the same exponential loss 


(21) as in the discrete case, with asymmetry embedded in the exponents, so the 


cost-proportionate weight initialization procedure (linked to a direct modulation 
of the exponential components) is again discarded. 


J(/(x)) = E [[y = lle-^-/(") + ly = -lle^«^W 


( 21 ) 


The Cost-Sensitive Boosting framework, also includes a third algorithm; 
Cost-Sensitive LogitBoost. Unlike the previous cases, the original (cost-insensitive) 
LogitBoost algorithm |7] is not based on minimizing an exponential loss, but on 


maximizing a Bernoulli log-likelihood (22): 


Z(/(x)) = E [y' log(p(x)) -f (1 - y') log(l - p(x))] 


( 22 ) 


where 


/ 2 / + 1 


P{y' = l|x) = p(x) = 


,F(x) 


eF(x) _j_ e-F(x) 

Then, to derive their Cost-Sensitive LogitBoost proposal, Masnadi-Shirazi 
and Vasconcelos define the probability of y' = 1 as follows 


p(x) = 


-I- 
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7 = 


Ci+Cz 


1 , fC2 

If we normalize cost factors by ‘^l{Ci + C 2 ) in the expressions above (note 
that the only relevant issue is the cost proportion, not their actual value), it is 


easy to see that, such a definition of p(x) is, in fact, equivalent to (23) 


p(x) = 




(23) 


(720/^ + Cie-ZW 

As a result, the optimization strategy inside Cost-Sensitive LogitBoost, is ac¬ 
tually based on embedding asymmetry by modulation of the exponential terms. 
Curiously, such an approach is just the same mechanism that, in the same work 
uni, was rejected for AdaBoost and RealBoost. Moreover, this strategy even¬ 
tually becomes a different initial weight distribution for the first weighted least- 
squares regression iteration of LogitBoost, in clear analogy with the mechanism 
in which Cost-Generalized AdaBoost in is based. 
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